Skip to content

fix tpd panic loop: cxo Finc-to-negative + httputil short-write discriminator#2422

Merged
0pcom merged 1 commit intoskycoin:developfrom
0pcom:fix/tpd-panics-cxo-finc-and-httputil-shortwrite
May 4, 2026
Merged

fix tpd panic loop: cxo Finc-to-negative + httputil short-write discriminator#2422
0pcom merged 1 commit intoskycoin:developfrom
0pcom:fix/tpd-panics-cxo-finc-and-httputil-shortwrite

Conversation

@0pcom
Copy link
Copy Markdown
Collaborator

@0pcom 0pcom commented May 4, 2026

Summary

Production TPD has been restarting every 30-40 seconds (RestartCount: 556 over 6 hours on the prod host running develop tip 96ee3806c). Two distinct panics, found by greping the container logs, are tearing down the process:

1. pkg/cxo/skyobject/cache.go(*Cache).Finc to negative

Stack:

panic: Finc to negative for: <hash>
github.com/skycoin/skywire/pkg/cxo/skyobject.(*Cache).Finc cache.go:1189
github.com/skycoin/skywire/pkg/cxo/skyobject.(*Filler).apply fill.go:225
github.com/skycoin/skywire/pkg/cxo/skyobject.(*Filler).Run.func1 fill.go:297

The filling-refcount goes below zero — likely a duplicate Finc on a Filler.incs map, or an Inc/Finc mismatch across overlapping fillers. The historical behavior was to panic, which became visible after #2420 increased the rate of fill churn (or the bug was always latent and load surfaced it).

Filler.apply and Filler.reject already consume Finc's error return and just log:

if err := f.c.Finc(key, inc); err != nil {
    log.Printf("[CXO] filler apply: failed to increment ref count for %s: %v", key.Hex(), err)
}

So surfacing the inconsistency through that path costs nothing extra. The fix clamps fc to 0, logs the offending key + inc, and continues. Worst case is a leaked filling-item slot — orders of magnitude better than a process kill every 30 seconds.

2. pkg/httputil/httputil.goWriteJSON panic on slow-client timeout

Stack:

panic: short write: i/o deadline reached
github.com/skycoin/skywire/pkg/httputil.WriteJSON httputil.go:50
github.com/skycoin/skywire/pkg/transport-discovery/api.(*API).getAllTransports endpoints.go:319

getAllTransports writes ~1MB JSON. On a slow client, net/http's per-write deadline expires mid-response, and the response writer returns an error whose message is "short write: i/o deadline reached". The body is io.ErrShortWrite's text, but the value is net/http.timeoutWriter's own error — errors.Is(err, io.ErrShortWrite) returns false. The fallback string match in isIOError only checked "broken pipe" and "connection reset", so the discriminator returned false and WriteJSON ran the panic branch.

Added "short write", "i/o timeout", and "deadline exceeded" to the string-match fallback. TestIsIOErrorShortWriteVariants pins both sentinel and string-match cases.

Test plan

  • go build ./... clean.
  • go vet ./pkg/cxo/... ./pkg/httputil/... clean.
  • go test ./pkg/httputil/ ./pkg/cxo/skyobject/ ./pkg/transport-discovery/... all pass.
  • gofmt clean.
  • New unit test TestIsIOErrorShortWriteVariants covers nil, all four current sentinels, the production short-write case, the i/o-timeout / deadline-exceeded text variants, and the existing broken-pipe / connection-reset cases.
  • Post-deploy: TPD container RestartCount no longer climbing; docker logs transport-discovery | grep -c panic: reaches 0 within a few minutes.

Notes

Neither bug is caused by the recent latency PRs (#2415, #2418, #2421); the panics' stacks make that explicit. They've just been hammering the deployment more visibly since latency persistence and outlier filtering landed back-to-back, because every TPD bounce now drops registration TTLs on visors that haven't re-pushed yet. Fixing these two restores normal TPD uptime.

…iminator

Production TPD was restarting every 30-40s (RestartCount 556 over 6h
on the prod host) because two distinct panics tear down the process:

1. pkg/cxo/skyobject/cache.go (*Cache).Finc:1189,1216
   panic: "Finc to negative for: <hash>"
   The filling-refcount went below zero — likely a duplicate Finc on
   a Filler.incs map, or an Inc/Finc mismatch across overlapping
   fillers. Hard process kill via panic. Filler.apply / Filler.reject
   already consume Finc's error return and just log; surface the
   inconsistency through that path instead. Clamp fc to 0, log the
   condition with key and the offending inc, and continue. Worst
   case is a leaked filling-item slot — orders of magnitude better
   than killing the service.

2. pkg/httputil/httputil.go WriteJSON:50
   panic: "short write: i/o deadline reached"
   isIOError checks errors.Is(err, io.ErrShortWrite), but
   net/http's timeoutWriter returns its own error value with the
   same message string when a write deadline expires mid-response.
   errors.Is misses it (different sentinel), the fallback string
   match didn't include "short write", so getAllTransports'
   ~1MB JSON write to a slow client panics on every deadline hit.
   Added "short write", "i/o timeout", and "deadline exceeded"
   to the string-match fallback. New TestIsIOErrorShortWriteVariants
   pins all sentinel and string-match cases.

Neither bug is caused by skycoin#2415/skycoin#2418/skycoin#2421; they were just made
visible because deploys cycling at 30-40s aren't subtle. Together
these stop the panic loop without changing any data semantics.
@0pcom 0pcom merged commit 80fef71 into skycoin:develop May 4, 2026
3 of 4 checks passed
0pcom added a commit that referenced this pull request May 4, 2026
#2423)

#2422 silenced the two dominant TPD panics ('Finc to negative',
'short write: i/o deadline reached'). RestartCount went from
~556/6h to 2/10min. The remaining two panics are nil-derefs in
pkg/cxo/node/head.go from torn-down fillHead state:

1. createFiller line 338: cr.c.String() panics when cr.c is nil.
   handleDelConn (line 517) sets f.p.c = nil when the source
   connection is removed. f.p.r remains non-zero, so the
   f.p == (connRoot{}) gate at handleFillingResult does NOT catch
   the nil-c case, and createFiller(f.p) runs with a nil source.
   Fix: nil-c short-circuit at the top of createFiller. The fill
   can't proceed without a source connection — log and return.

2. handleSuccess line 228 (and the parallel sites in handleRequest,
   handleRequestFailure, handleReceivedRoot): f.fc.PushBack on a
   nil *list.List. closeFiller (line 384) sets
   `f.rqo, f.fc, f.rq = nil, nil, nil`, so any in-flight Filler
   goroutine messages that drain after close land on a nil list.
   Fix: nil-guard each PushBack/PushFront site (4 sites total).
   Same recovery as #2422's panic→drop pattern: the closed filler's
   work is no longer relevant.

These are the actual races; the right "real" fix would be to drain
or signal-shutdown the channels before nilling state, but that's a
larger structural change. Guards stop the process kill while
preserving observability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant